Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction and planning. As sensors and hardware get improved, there is trending popularity to devise a system that can perform a wide diversity of tasks to fulfill higher-level intelligence. Contemporary approaches resort to either deploying standalone models for individual tasks, or designing a multi-task paradigm with separate heads. These might suffer from accumulative error or negative transfer effect. Instead, we argue that a favorable algorithm framework should be devised and optimized in pursuit of the ultimate goal, i.e. planning of the self-driving-car. Oriented at this goal, we revisit the key components within perception and prediction. We analyze each module and prioritize the tasks hierarchically, such that all these tasks contribute to planning (the goal). To this end, we introduce Unified Autonomous Driving (UniAD), the first comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query design to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven to surpass previous state-of-the-arts by a large margin in all aspects. The full suite of codebase and models would be available to facilitate future research in the community.
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
基于骨架的人类行动识别是由于其复杂的动态而是一项长期挑战。动态的一些细颗粒细节在分类中起着至关重要的作用。现有的工作主要集中在设计带有更复杂的相邻矩阵的增量神经网络上,以捕获关节关系的细节。但是,他们仍然很难区分具有广泛相似运动模式但属于不同类别的动作。有趣的是,我们发现运动模式上的细微差异可以显着放大,并且可以轻松地通过指定的视图方向来区分观众,在这些方向上,该属性以前从未得到充分探索。与以前的工作截然不同,我们通过提出一种概念上简单而有效的多视图策略来提高性能,该策略从一系列动态视图功能中识别动作。具体而言,我们设计了一个新颖的骨骼锚定建议(SAP)模块,该模块包含一个多头结构来学习一组视图。为了学习不同观点的特征学习,我们引入了一个新的角度表示,以在不同视图下的动作转换并将转换归因于基线模型。我们的模块可以与现有的动作分类模型无缝合作。与基线模型合并,我们的SAP模块在许多具有挑战性的基准上展示了明显的性能增长。此外,全面的实验表明,我们的模型始终击败了最新的实验,并且在处理损坏的数据时保持有效和健壮。相关代码将在https://github.com/ideal-idea/sap上提供。
translated by 谷歌翻译
医学成像的病变分割是临床研究中的一个重要课题。研究人员提出了各种检测和分段算法来解决这项任务。最近,基于深度学习的方法显着提高了传统方法的性能。然而,大多数最先进的深度学习方法需要手动设计多个网络组件和培训策略。在本文中,我们提出了一种新的自动化机器学习算法T-Automl,不仅搜索最佳神经结构,而且还可以同时找到超参数和数据增强策略的最佳组合。该方法采用现代变压器模型,引入了适应搜索空间嵌入的动态长度,并且可以显着提高搜索能力。我们在几个大型公共病变分割数据集上验证T-Automl并实现最先进的性能。
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In neural architecture search (NAS) methods based on latent space optimization (LSO), a deep generative model is trained to embed discrete neural architectures into a continuous latent space. In this case, different optimization algorithms that operate in the continuous space can be implemented to search neural architectures. However, the optimization of latent variables is challenging for gradient-based LSO since the mapping from the latent space to the architecture performance is generally non-convex. To tackle this problem, this paper develops a convexity regularized latent space optimization (CR-LSO) method, which aims to regularize the learning process of latent space in order to obtain a convex architecture performance mapping. Specifically, CR-LSO trains a graph variational autoencoder (G-VAE) to learn the continuous representations of discrete architectures. Simultaneously, the learning process of latent space is regularized by the guaranteed convexity of input convex neural networks (ICNNs). In this way, the G-VAE is forced to learn a convex mapping from the architecture representation to the architecture performance. Hereafter, the CR-LSO approximates the performance mapping using the ICNN and leverages the estimated gradient to optimize neural architecture representations. Experimental results on three popular NAS benchmarks show that CR-LSO achieves competitive evaluation results in terms of both computational complexity and architecture performance.
translated by 谷歌翻译
当前的端到端自动驾驶方法要么基于计划的轨迹运行控制器,要么直接执行控制预测,这已经跨越了两条单独研究的研究线。本文看到了它们彼此的潜在相互利益,主动探讨了这两个发展良好的世界的结合。具体而言,我们的集成方法分别有两个用于轨迹计划和直接控制的分支。轨迹分支可以预测未来的轨迹,而控制分支则涉及一种新颖的多步预测方案,以便可以将当前动作与未来状态之间的关系进行推理。连接了两个分支,因此控制分支在每个时间步骤中从轨迹分支接收相应的指导。然后将来自两个分支的输出融合以实现互补的优势。我们的结果在闭环城市驾驶环境中进行了评估,并使用CARLA模拟器具有挑战性的情况。即使有了单眼相机的输入,建议的方法在官方Carla排行榜上排名第一$,超过了其他具有多个传感器或融合机制的复杂候选人。源代码和数据将在https://github.com/openperceptionx/tcp上公开提供。
translated by 谷歌翻译
Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED.
translated by 谷歌翻译
Machine learning (ML) has been increasingly used to aid aerodynamic shape optimization (ASO), thanks to the availability of aerodynamic data and continued developments in deep learning. We review the applications of ML in ASO to date and provide a perspective on the state-of-the-art and future directions. We first introduce conventional ASO and current challenges. Next, we introduce ML fundamentals and detail ML algorithms that have been successful in ASO. Then, we review ML applications to ASO addressing three aspects: compact geometric design space, fast aerodynamic analysis, and efficient optimization architecture. In addition to providing a comprehensive summary of the research, we comment on the practicality and effectiveness of the developed methods. We show how cutting-edge ML approaches can benefit ASO and address challenging demands, such as interactive design optimization. Practical large-scale design optimizations remain a challenge because of the high cost of ML training. Further research on coupling ML model construction with prior experience and knowledge, such as physics-informed ML, is recommended to solve large-scale ASO problems.
translated by 谷歌翻译
我们提出了一种从有限的训练数据学习高维参数映射的解析替代框架。在许多需要重复查询复杂计算模型的许多应用中出现了对参数代理的需求。这些应用包括贝叶斯逆问题,最佳实验设计和不确定度的最佳设计和控制等“外环”问题,以及实时推理和控制问题。许多高维参数映射承认低维结构,这可以通过映射信息的输入和输出的绘图信息的减少基础来利用。利用此属性,我们通过自适应地构造其输入和输出的缩小基础之间的Reset近似来制定用于学习这些地图的低维度近似的框架。最近的近似近似理论作为控制流的离散化,我们证明了我们所提出的自适应投影Reset框架的普遍近似性,这激励了Resnet构造的相关迭代算法。该策略代表了近似理论和算法的汇合,因为两者都使用顺序最小化流量。在数值例子中,我们表明,在训练数据少量的培训数据中,能够实现显着高精度,使其能够实现培训数据生成的最小计算投资的理想代理策略。
translated by 谷歌翻译